Everything about Character Encodings In Html totally explained
HTML has been in use since
1991, but HTML 4.0 (December 1997) was the first standardized version where international
characters were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bit
ASCII two goals are worth considering: the information's
integrity, and universal
browser display.
The document character encoding
When HTML documents are served there are three ways to tell the browser what specific character encoding is to be used for display to the reader. First,
HTTP headers can be sent by the
web server along with each web page (HTML document). A typical HTTP header looks like this:
Content-Type: text/html; charset=ISO-8859-1
For
HTML (not usually
XHTML), the other method is for the HTML document to include this information at its top, inside the
HEAD element.
<meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">
XHTML documents have a third option: to express the character encoding in the
XML preamble, for example
<?xml version="1.0" encoding="ISO-8859-1"?>
These methods each advise the receiver that the file being sent uses the character encoding specified. The character encoding is often referred to as the "character set" and it indeed does limit the characters in the raw source text. However, the HTML standard states that the "charset" is to be treated as an encoding of
Unicode characters and provides a way to specify characters that the "charset" doesn't cover. The term
code page is also used similarly.
It is a bad idea to send incorrect information about the character encoding used by a document. For example, a server where multiple users may place files created on different machines can't promise that all the files it sends will conform to the server's specification — some users may have machines with different character sets. For this reason, many servers simply don't send the information at all, thus avoiding making false promises. However, this may result in the equally bad situation where the
user agent displays the document incorrectly because neither sending party has specified a character encoding.
The HTTP header specification supersedes all HTML (or XHTML)
meta tag specifications, which can be a problem if the header is incorrect and one doesn't have the access or the knowledge to change them.
Browsers receiving a file with no character encoding information must make a blind assumption. For Western European languages, it's typical and fairly safe to assume
windows-1252 (which is similar to
ISO-8859-1 but has printable characters in place of some control codes that are forbidden in HTML anyway), but it's also common for browsers to assume the character set native to the machine on which they're running. The consequence of choosing incorrectly is that characters outside the printable ASCII range (32 to 126) usually appear incorrectly. This presents few problems for
English-speaking users, but other languages regularly — in some cases, always — require characters outside that range. In
CJK environments where there are several different multi-byte encodings in use, auto-detection is often employed.
It is increasingly common for multilingual websites to use one of the
Unicode/
ISO 10646 transformation formats, as this allows use of the same encoding for all languages. Generally
UTF-8 is used rather than
UTF-16 or
UTF-32 because it's easier to handle in programming languages that assume a
byte-oriented ASCII superset encoding, and it's efficient for ASCII-heavy text (which HTML tends to be).
Successful viewing of a page isn't necessarily an indication that its encoding is specified correctly. If the page's creator and reader are both assuming some machine-specific character encoding, and the server doesn't send any identifying information, then the reader will nonetheless see the page as the creator intended, but other readers with different native sets won't see the page as intended.
Character references
In addition to native character encodings, characters can also be encoded as
character references, which can be
numeric character references (
decimal or
hexadecimal) or
character entity references. Character entity references are also sometimes referred to as
named entities, or
HTML entities for HTML. HTML's usage of character references derives from
SGML.
Character entity references have the format
&name; where "name" is a case-sensitive alphanumeric string. For example, the character 'λ' can be encoded as
λ in an HTML 4 document. Characters <, >, " and & are used to delimit tags, attribute values, and character references. Character entity references
<,
>,
" and
&, which are predefined in HTML, XML, and SGML, can be used instead for literal representations of the characters.
Numeric character references can be in decimal format,
DD;, where
DD is a variable-width string of decimal digits. Similarly there's a hexadecimal format,
HHHH;, where
HHHH is a variable-width string of hexadecimal digits, though many consider it good practice to never use fewer than four hex digits, and never use an odd number of hex digits (due to the correspondence of two hex digits to one byte). Unlike named entities, hexadecimal character references are case-insensitive in HTML. For example, λ can also be represented as
λ,
λ or
λ.
Numeric references
always refer to
Universal Character Set code points, regardless of the page's encoding. Using numeric references that refer to UCS control code ranges is forbidden, with the exception of the linefeed, tab, and carriage return characters. That is, characters in the hexadecimal ranges 00–08, 0B–0C, 0E–1F, 7F, and 80–9F can't be used in an HTML document,
not even by reference —so "™", for example, isn't allowed. However, for backward compatibility with early HTML authors and browsers that ignored this restriction, raw characters and numeric character references in the 80–9F range are interpreted by some browsers as representing the characters mapped to bytes 80–9F in the Windows-1252 encoding.
Unnecessary use of HTML character references may significantly reduce HTML readability. If the character encoding for a web page is chosen appropriately then HTML character references are usually only required for a few special characters (or not at all if a native
Unicode encoding like
UTF-8 is used).
XML character entity references
Unlike traditional HTML with its large range of character entity references, in
XML there are only five predefined character entity references. These are used to escape characters that are markup sensitive in certain contexts:
- & → & (ampersand, U+0026)
- < → < (less-than sign, U+003C)
- > → > (greater-than sign, U+003E)
- " → " (quotation mark, U+0022)
- ' → ' (apostrophe, U+0027)
All other character entity references have to be defined before they can be used. For example, use of é (which gives é, Latin small letter E with acute, U+00E9, in HTML) in an XML document will generate an error unless the entity has already been defined. XML also requires that the x in hexadecimal numeric references be in lowercase: for example
ਛ rather than
ਛ.
XHTML, which is an XML application, supports the HTML 4 entity set and XML's ' entity, which doesn't appear in HTML 4.
However, use of ' in XHTML should generally be avoided for compatibility reasons. ' may be used instead.
HTML character entity references
For a list of all named HTML character entity references, see
List of XML and HTML character entity references (approximately 250 entries).
Further Information
Get more info on 'Character Encodings In Html'.
|
External Link Exchanges
Do you know how hard it is to get a link from a large encyclopaedia? Well we're different and will prove it. To get a link from us just add the following HTML to your site on a relevant page:
<a href="http://character_encodings_in_html.totallyexplained.com">Character encodings in HTML Totally Explained</a>
Then simply click through this link from your web page. Our crawlers will verify your link, extract the title of your web page and instantly add a link back to it. If you like you can remove the words Totally Explained and embed the link in article text.
As long as your link remains in place, we'll keep our link to you right here. Please play fair - our crawlers are watching. Your site must be closely related to this one's topic. Any kind of spamming, dubious practises or removing the link will result in your link from us being dropped and, potentially, your whole site being banned. |